Workbook

  • You can run the code in these slides by opening 02-data-frames.R and following along.
  • ALL of the code you see on the screen today is in your workbook.
  • You will learn more if you see the code run for yourself.
    • This is technically an untested hypothesis. Sorry.
  • There are a few code examples I have had to comment out in order for my slides to build correctly. Feel free to uncomment most of this code (unless it is otherwise noted).
    • RStudio Shortcut - [CTRL]+[SHIFT]+c

Learning Objectives

  • Data Frames
  • Working Directory
  • Import CSV
  • Filtering and Using Data Frames
  • Getting Help

Data Frames

What is a Data Frame?

  • Historically - called a data.frame.
  • Vectors are 1-dimensional (length).
  • Vectors can contain only one data type (integer, character, date).
    • Vectors containing multiple data types are characters.
  • Data frames are multi-dimensional.
    • Today, 2-dimensional data frames only.
    • There is no limit to the dimensionality of your data.
  • Data frames can contain multiple data types.
    • This is enormously useful for analytical projects.
  • This data structure is one of R's key competitive advantages.
    • Pandas!

Your First Data Frame

  • R comes with many demo data sets.
  • Most of these are data frames.
## Loads the iris data set. It provides the measurements (cm)
## of the sepal length and width, and petal length and width,
## respectively, for 50 flowers from 3 species of iris.
data(iris)

DEMO: Four Ways To View A Data Frame:

  1. Use the RStudio Interface. Top-Right Pane
  2. Use the view command. View(iris).
  3. Enter the name of the data fram into the REPL console.
    • Easiest, but inefficient for large data sets.
  4. Use the head() and tail() commands.

Data Frame Dimensions

## How many rows and how many columns are in iris?
dim(iris)

Returns:

  • [1] 150 5
  • Always: Rows, Columns
## What are the name of the columns?
names(iris)

Returns:

  • [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

Use A Column

## It is EASY to access a single column of data for use in analysis.
iris$Sepal.Length

Returns:

  • Nahh. We need to see a demo.
## The Sepal.Length column is numeric. We can find it's average.
mean( iris$Sepal.Length )

Returns:

  • [1] 5.843333
  • R is picky. Capitalization ALWAYS matter.
  • Because this lets us work with a single vector, we can use the indexing we learned earlier.

Lower Case Column Names

## I'm lazy - I hate typing upper-case letters in column names.
## The tolower function can be used to make them lower case.
## We can then assign these new names to the col names of the
## iris data object. This makes life better.
names(iris) <- tolower( names(iris) )

names(iris)

Returns:

  • [1] "sepal.length" "sepal.width" "petal.length" "petal.width" "species"
  • The spaces in the code are for clarity. They are not required.

Practice: Load A Data Frame

## Q1: Load the 'mtcars' data set
##     mtcars is a data set about cars from the late 70's.


## Q2: What is the average Miles Per Gallon (mpg) and the standard deviation of mpg?
##     The 'mean' and 'sd' functions will be helpful for figuring this out.


  • This is the last non-boat related example.
  • From here on out, we're going to get our feet wet in some Titanic data.

Importing Data

Working Directory

  • Like most programming languages, R has a working directory.
    • This is "root" folder of a project.
    • This concept is critical when using external files.
  • Fortunately, you do not have to do this manually!
## Knowing the current working directory is VERY important for
## interacting with external files. R prefers relative paths
## to external files. Avoid absolute paths when you can.
getwd()

Set Working Directory

Ways to set the Working Directory:

  1. Code (Example code below.)
  2. RStudio has a menu option for this in the "Session" menu.
  3. RStudio Project! (This is the best option)
    • In fact, it is so good, we are going to demo it.
    • And then you have to do it on your computer.
## Sets the working directory to a folder called sinking-boats.
## You don't want to run this unless it is a real folder.
## This isn't. So don't run it, until you modify it.
setwd("/home/andy/sinking-boats")
  • If you do this in code, I recommend using UNIX-style slashes (/).

Practice: Create A New R Project

## Create a new RStudio Project in the folder where you
## unzipped the workshop files. R will switch to this project
## automatically.


NOTE:

  • If you aren't sure you are doing this correctly, let us know.
  • Everyone needs to complete this done successfully.

Import CSV

Directory List

  • Because we set the working directory, we can use relative paths in our code.
## To see the list of files in the current working directory.
dir()
  • If you run this and don't see the .R files we are using, TELL US!


## To see the list of files in the data folder.
dir(path="data")
  • In the first example, we implicitly used the default path, which is ".", AKA the current working directory.

Import Titanic Data

## Imports the Titanic training data set from the data folder.
## After running this, you should see the tit_train data set in the
## Environment tab in RStudio.
ti_train <- read.csv("data/train.csv", as.is=TRUE)

NOTE:

  • Tell us if you are having trouble with this.
  • Setting as.is = TRUE lets us avoid importing the data in Factors.
    • We haven't talked about factors yet, but they often cause more problems than they are worth.
  • Why ti_train? This data set was originally set up by Kaggle for introducing people to Machine Learning. I wanted to make it clear it is "Titanic" data, but keep the training data set distinct from the test data set.

Practice: Explore Titanic Data

  • Use one of the tools we have leared today to look at ti_train.
  • The str() command is another way to explore the data.
    • This is one of my favorite commands because it is compact and works well from the command-line interface I prefer.
  • Documentation for the Kaggle Titanic data is available in data/README.md
    • Open this file with RStudio!


## Compactly display the structure of an arbitrary R object.
## This works on objects other than data frames, but that is
## what most people will use it for most often.
str(ti_train)

HELP!

What Was That Command?

## When you can't remember the name of the command you
## are looking for, RStudio offers command and variable
## completion to help make life easier.
is.

Local Help

## Sometimes you need help with a specific command.
## R's built-in help is good. Not perfect. But good.
help(round)

## Hackers are universally lazy (that's why we became
## hackers). Typing help is too much work. A question
## mark accomplishes the same thing.
?round
  • RStudio makes it easy to find and use R's built-in help.
  • I suppose most of you will just use that.

Remote Help